-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Seed Dataset to improve compatibility and simplify usage #1734
base: master
Are you sure you want to change the base?
Conversation
initialized from HF/Pytorch/JSON/list of Dicts, remove the need for setup call and subsequently cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @apokryphosx, I left some comments.
test coverage
instead of strings and add seed for reproducibility
between simply skipping invalid datapoints in a seed dataset and throwing an exception
seed dataset to ensure they are defined before the other functions are
to store data in Seed Dataset
changes in base.py
requirements
getitem and cast len(data) to a Sized to pass mypy tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @apokryphosx, left some comments.
), "Dataset does not support indexing." | ||
|
||
self._raw_data = [dict(data[i]) for i in range(len(data))] | ||
elif isinstance(data, Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some more safety features here. The current configuration does not:
- check if the list contains only dictionaries
- check if the file is valid JSON before attempting to load
- handle potential encoding issues
validating each against the DataPoint schema. | ||
This class can initialize from Hugging Face Datasets, | ||
PyTorch Datasets, JSON file paths, or lists of dictionaries, | ||
converting them into a consistent internal format. | ||
""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This __init__
is getting really messy. It would be better to create specific methods to handle the conversion.
logger.debug("No raw data to process") | ||
return | ||
|
||
def create_datapoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is supposed to validate whether a dict is a valid DataPoint
. Providing default values for non-existant Keys defeats the whole purpose.
Note that it should not fail if there are extra fields.
Description
This PR solves:
#1732
#1733
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.Fixes #issue-number
in the PR description (required)pyproject.toml
andpoetry.lock
If you are unsure about any of these, don't hesitate to ask. We are here to help!